Video conference audio mixing algorithm and its implementation

With the rapid development of the interconnected technology, the data ploughs that are spread on the interconnected blades are gradually increasing. The video conferencing system came into being, the communication and communication of the å¨„ ,, the use of listening and medullary music, the transmission of voice became the most important indicator of the performance of the most video conferencing system. At the same time, the mixing algorithm in the video conferencing system is very important.
The problem that the terminal must solve for the processing of the arpeggio signal is how the multi-channel anhydride first data is mixed and played locally. It will be synchronized by itself. Delay, synchronization with video and other effects, in actual applications, the overflow of the sound card buffer after mixing is the biggest problem.
Here is a description of the mixing structure, /A~ WJ improved mixing algorithm. In the quality of the mix, the ship, the stagnation rate, the delay and the scalability, and the existing mixing algorithm J comparison. The experimental results show that the characteristics of the knotting TAå¨„ speech of the algorithm, according to the needs of the small scene, while suppressing the mix overflow III, the quality of the harsh mix is â€‹â€‹reduced, the delay is reduced, and it has good practical potential.
1, mixing algorithm analysis
A pressure wave generated by the vibration of an object. Loudness, pitch and tone are the three most important features of sound. In nature, the voice heard by the human ear comes from the superposition of four and eight sounds. For the video conferencing system, the prefix data from the other side needs to be mixed in the time domain. The sampling and quantification of the speech signal are placed on the sound card chip. The commonly used sound card is 16 bits, and the sputum viscosity is 16 bit. In many operating systems such as Linux, the data class of the sound card buffer is usually .iLmed short, ranging from -32768 to 32767. After multi-channel mixing, the amplitude may be too small for the sound card to accept the distortion of the sound. The above several common solutions.
(1) Direct clamp method
After mixing, the speech intensity exceeds the buffer data class and is replaced by the maximum value. In this way, the ç˜ position will cause the artificial waveform clipping of the signal waveform, and the same call that destroys the characteristics of the voice signal will promote the noise.
(2) uniformization and mixing
The averaged mix is â€‹â€‹superimposed after the different way, and divided by the number of sticks to ensure a small overflow after mixing. However, as the number of mixing channels increases, when a plurality of mixing sounds are simultaneously sounded at the same time, the voice signals from either side will be multi-channelly divided, resulting in the smallest sound and small energy recognition.
(3) Align the mix
It can be said that the foot mean value is mixed with the å¨ˆ type, where the upper part is strongly aligned and weakly aligned. In the strong alignment, a larger mixing weight is given to the mixing circuit with a larger sound intensity, and the original voice has a larger vocal path, and the disadvantage is that the smaller H} sound is submerged. Weak alignment gives a larger mix weight to a mix that has a lower sound intensity. In this way, the first smallest mixing path is enlarged, and the jump point is also enlarged to the Jr background noise.
Although the above algorithms are simple, they all have the problem of deleting and mixing the best sound quality. Next, we introduce a new and improved mixing scheme and algorithm.
2. Improved mixing algorithm
In the SIP-based video conferencing system, there are various ways of constructing according to the different types of signaling and media mixing. 2 In terms of the media stream mixing mode, there is a centralized mixing and terminal mixing.
Here, the distributed mixing model of Figure 1 is designed. Compared with the centralized mixing, the server does not process the media stream, but only the management strategy of the conference system. The terminal receives the voice data that has been distributed, and after the decoding process, the mixing is started. This mode is mixed into the dagger package of the terminal itself, and the shadow of the echo is small. On the whole, such a stick sound system is moderately complex, which can alleviate the pressure on the conference server. Delays are less concentrated than centralized mixing, and performance is much more stressful for systems that use real-time applications.
Figure 1 Distributed Mixing Model
Designed for video conferencing systems for SMEs and schools, the number of participants is less than 5 people. Participants in the conference are less likely to speak together, and the rate of strong spills is relatively small. Because the speech signal has a short-term correlation, that is, a frame as described here. This time is usually between 10 ms and 30 ms. In the design of the mixing algorithm, both overflow and smoothing are considered, and the algorithm is as follows.
1 initialization attenuation factor f _see is 1;
2 The letter soap in the unified data, including the absolute value of the maximum peak value, the shortest time within one frame can pass the zero rate,
3 Compare the maximum absolute peak value with the width of the i6 bin number, and judge that it is a tongue-splitting mountain. If y she finds the appropriate attenuation ç½”r and updates (f see-Ma/sos, Max is the absolute value of the maximum rm value, s is the product of the maximum absolute peak ä½° and the day f frame attenuation field f), and the latest The maximum reduction r is multiplied by the frame data, and the sword sound card buffer is output.
4 If there is no sail, the attenuation factor is dynamically changed according to the training of short-term maximum and zero-crossing rate. And use the latest attenuation factor to output the frame data to the sound card buffer;
5 Read the next frame and perform step 2 again.
For the determination of the overflow, the attenuation a nÂ± used in the literature [4] is multiplied by each sample of each frame and the attenuation. In the commonly used fixed-point processors, more multiplication and division will consume CPU resources and delay. Feel free to change the relative size of the sample to cause distortion of the mixed J speech. This ltrXil performs the processing of the sound signal according to the frame, and it will process the content of the voice. Through the smoothing process, the portamento information of the frame is reduced according to the side, and the speech feature parameters are
After the sample overflows and is attenuated, a mechanism is needed to effectively compensate. Here, 10^ first attenuates the iR-chemical principle, maps the attenuation ff see to 320 equal parts, and uses the most ppp of the animal to express it, ie f_se e-Tppn 20, ppp is 320 multiplied by f see After the whip sticks. Because the R C17ci ~i female is used in the speech codec, the A law captures the expansion, enhances the precision of the small signal, and the training of the small signal is more than the J semi-rich state. Here, the upper and lower limits of the increase and decrease of ppp are set to 160 to 320. In the process of stripping and mixing, Kenken is placed in the small signal sensitive part, avoiding the rough signal and adjusting to the comfort of the human ear.
The normalized attenuation, the increase or decrease of the yaw speech signal, is judged by the zero-crossing rate, and the short-term comparison of the adjacent koji is the shortest. In the 4th step action, the training of the ç½”r is attenuated. Refer to the short-day of the voice vocal in step 2 - the most likely to pass the zero rate. According to the experience in the root cup document [7], the value of the zero-crossing rate of the iO ms frame is 4. "I æ¦† æ¦† æ¦† çš„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ åº„ ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²— ç²—Calling the gradual increase of the attenuation network f, "the short squeak of the voice can increase the gradual reduction of the attenuation field Jf-. This kind of attenuation of the field j' will be abolished and the signal will be aggravated and robust. Sex.
Designed as a video conferencing system for SMEs and schools, the number of participants is less than five. It is less likely that the conference will participate in the green common skills, and the rate of strong stagnation will be relatively small. Gang has a short-term correlation for speech signals. The time is usually between io ms and 30 ms, which is the frame that comes with it. Here, in the discussion of the mixing algorithm called the é•’ é•’ å¹³ æŸ‘ æŸ‘ ,, and in the voice frame unit, the algorithm 'flow is as follows:
3, embedded implementation and results analysis
In order to compare the performance of the improved mixing algorithm, the written mixing algorithm is tested in the embedded environment in which the video conferencing system is actually running. The processor is TI's DaVinci series of system-on-chip DM6446-594 heterogeneous dual-core processor. The mixing algorithm is run on the ARM side. The ARM core is the ARM9 series and runs at 297M. A total of 3 voices are collected here, all of which are pure background noise, all the way to the human voice, the sound intensity is close to the overflow position, the other is the human voice, and the sound intensity is moderate. Both vocals use male voices that are more difficult to distinguish. The test results are shown in Figures 2, 3 and 4.
It can be seen from the time domain waveform diagram after mixing, and Figure 2 shows the mixing process using the simple overflow attenuation algorithm. As the mixing process progresses, the volume gradually decreases. When a peak of mixing occurs at a certain moment, it causes a very small mitigating factor, and the volume becomes small and cannot be recovered. Figure 3 shows the method of increasing the attenuation factor by the fixed step size in the reference [5]. This mixing method keeps the volume near the maximum value, and the sound is harsh and the noise is strong. It increases the rate of overflow for the next mix, increases overflow detection and new sag factor calculations, consumes resources, and introduces latency. Figure 4 shows the improved algorithm here, which uses frame-by-frame attenuation, which is different from the sample-by-sample processing in [5], which reduces the multiplication and division operations of fixed-point processors and improves computational performance. The sag factor is normalized and subdivided, and the upper and lower limits are set. The short-term energy and zero-crossing rate are used to recognize and dynamically update the attenuation factor. The mix sounds smooth, with little noise, no plosives, and no spillover after mixing.
4, the conclusion
Here, the attenuation factor method based on the characteristics of the speech signal and the speech frame is solved to solve the problem of the mixing overflow, and the algorithm is improved to improve the performance of the overflow processing. Moreover, the short-time energy and short-time zero-crossing rate are used to perform coarse detection and attenuation compensation on vocal speech, which improves the mixing quality. Users can choose between two algorithms according to the network environment. The performance and effect of the algorithm are proved by the implementation on the fixed-point processor ARM9 and the result analysis.

Plug-in Type Cassette PLC Splitter
PLC Planar Waveguide Optical Splitter (PLC Splitter) is an integrated waveguide optical power distribution device based on a quartz substrate. Like coaxial cable transmission systems, optical network systems also need to couple, branch, and distribute optical signals. Need Optical Splitter to realize.
Plc Splitter is one of the most important passive components in optical fiber links. It is an optical fiber tandem device with multiple input ends and multiple output ends. It is especially suitable for connecting the central office and the central office in passive optical networks (EPON, GPON, BPON, etc.) Terminal equipment and realize the branch of the optical signal.
The advantages for PLC Splitter:
(1) Loss is not sensitive to the wavelength of the transmitted light, and can meet the transmission needs of different wavelengths.
(2) The light is evenly split, and the signal can be evenly distributed to users.
(3) The structure is compact and the volume is small, and it can be directly installed in various existing transfer boxes without special design to leave a lot of installation space.
(4) There are many shunt channels for a single device, which can reach more than 32 channels.
(5) The multi-channel cost is low, and the more the number of branches, the more obvious the cost advantage.

Cassette Plc Splitter,Plug-In Plc,Plug-In Type Cassette Plc Splitter,Optic Fiber Splitter
Shenzhen GL-COM Technology CO.,LTD. , https://www.szglcom.com